ftp.cs.arizona.edu

home *** CD-ROM | disk | FTP | other *** search

/ ftp.cs.arizona.edu / ftp.cs.arizona.edu.tar / ftp.cs.arizona.edu / icon / newsgrp / group97a.txt / 000052_icon-group-sender _Thu Feb 27 09:59:21 1997.msg < prev next >

Wrap

Internet Message Format | 2000-09-20 | 3KB

Received: by cheltenham.cs.arizona.edu; Fri, 28 Feb 1997 08:56:42 MST To: icon-group@cs.arizona.edu Date: Thu, 27 Feb 1997 09:59:21 -0500 From: Jan Galkowski <jan@solstice.digicomp.com> Message-Id: <3315A149.41C67EA6@solstice.digicomp.com> Organization: Digicomp Research Corporation Sender: icon-group-request@cs.arizona.edu References: <331437DE.41C67EA6@solstice.digicomp.com>, <5f1r3u$b7l@nef.ens.fr> Subject: Re: Icon and two-dimensional matching Errors-To: icon-group-errors@cs.arizona.edu Status: RO Content-Length: 2336 Marc Espie wrote: > > In article <331437DE.41C67EA6@solstice.digicomp.com>, > Jan Galkowski <jan@solstice.digicomp.com> wrote: > [SMALLER reply though] > >[LONG post warning!] [snip] > > Icon tables and sets are indeed a very powerful mechanism. However, > they sometimes suffer from their generality. The current implementation > uses hash functions that assume a rather smooth average distribution > of the data. [snip] > However, I've also ran into some spectacular failures. You see, the > sets of words I study are NOT average. Indeed, I am trying to find > some non obvious correlations between those... and the hashing functions... > well, I've got cases where everything ended up in one or two buckets, > or rather larger sets where the hashing process failed due to the > sheer size of the data. > > I would rather like to be able to specify MORE things about my sets > (average size, distribution of data, other hashing functions) than > is possible with the current implementation. Well, hash functions do have the drawback of making assumptions about the distribution of data. This doesn't matter if the implementation is Icon, Perl, or APL. (APL doesn't have a key-table lookup mechanism like Icon and Perl do, but it does use a hash function to retrieve variable identifiers and sometimes APL programmers use a plethora of global variables to achieve the same effect.) That's part of their nature. > > Right now, I have to reinvent the wheel, and reimplement somehow my > own sets without all the clarity Icon procures to us. Apart from the address space limitation -- which I'd think you'd overcome by defining a bigger table than you need -- I'd think the answer to the poor distribution problem would be to "prehash" the keys instead of investing in a lot of effort to build lookup routines, adding routines, removal routines, etc. If one can tell lots of distinct keys map into the same two buckets, use their distinctions to remap them into a wider scatter. Why couldn't you use the Icon "map" to systematically jumble things up a bit? > [snip] I'm curious: What does your post have to do with two-dimensional matching, or did it just comment on my lead paragraphs? -- Jan Theodore Galkowski, Developer, tool & numerical methods elf Digicomp Research Corporation, Ithaca, NY 14850-5720